Skip to content

Add stagingModeAutoFlushThreshold for staging mode batch control#26577

Merged
anthony-murphy merged 30 commits intomicrosoft:mainfrom
anthony-murphy-agent:staging-batch-control
Mar 25, 2026
Merged

Add stagingModeAutoFlushThreshold for staging mode batch control#26577
anthony-murphy merged 30 commits intomicrosoft:mainfrom
anthony-murphy-agent:staging-batch-control

Conversation

@anthony-murphy-agent
Copy link
Copy Markdown
Contributor

@anthony-murphy-agent anthony-murphy-agent commented Feb 27, 2026

Summary

  • Add stagingModeAutoFlushThreshold option to ContainerRuntimeOptions that controls automatic batch flushing during staging mode
  • When in staging mode, suppress turn-based/async flush scheduling until the accumulated batch reaches the threshold op count
  • Default threshold: 1000 ops (tied to largeBatchThreshold constant) — incoming ops always break the batch regardless
  • Wrap exitStagingMode in a PerformanceEvent reporting duration, exit method, batch count, and batches at or over threshold
  • Remove the DisableFlushBeforeProcess kill-bit flag (split out to Remove DisableFlushBeforeProcess feature flag #26770)

Default Justification (from production telemetry)

  • Copy-paste operations routinely produce batches of 1000+ ops (435K instances over 30 days via GroupLargeBatch telemetry)
  • All observed large batches are non-reentrant single-turn batches from normal user actions (not reconnection replay — reconnect preserves batch boundaries)
  • Receivers on modern Fluid versions (2.74+) handle 1000-op batches without jank (p99 processing duration ~5ms)
  • 1000 matches the existing "large batch" telemetry threshold in OpGroupingManager
  • The threshold only affects cross-turn accumulation; single-turn operations (like paste) are unaffected

Key Design Points

  • Only affects scheduleFlush() — direct flush() calls (incoming ops, connection changes, stashing, exit staging mode) bypass the threshold entirely
  • No effect outside staging mode
  • Exposed on the public ContainerRuntimeOptions interface (@legacy @beta) with forwardCompat: false type validation break acknowledged — consumers using Partial<ContainerRuntimeOptions> (the typical pattern via IContainerRuntimeOptions) are unaffected
  • Config override (Fluid.ContainerRuntime.StagingModeAutoFlushThreshold) > runtime option > default (1000)

Telemetry

  • ExitStagingMode perf event: duration, exitMethod (commit/discard), autoFlushThreshold, batches, batchesAtOrOverThreshold
  • GroupLargeBatch threshold changed from >= to > so staging-mode auto-flush batches (exactly at threshold) don't trigger the event

Test plan

  • Ops accumulate under threshold during staging mode
  • Ops flush when threshold is reached (with telemetry assertion)
  • Incoming ops break batch regardless of threshold
  • Incoming non-runtime ops break batch during staging mode
  • Reconnect breaks batch during staging mode
  • Exit staging mode (commit) flushes remaining ops
  • Exit staging mode (discard) flushes outbox before rollback
  • enterStagingMode flushes pending outbox as non-staged
  • No effect outside staging mode
  • Default threshold suppresses turn-based flushing during staging mode
  • Config override > runtime option > default precedence
  • Runtime option > default precedence
  • IdAllocation + reconnect while in staging mode (highest risk — b4e1fd1 interaction)
  • Reconnect resubmits pre-staged batches with threshold active
  • All 944 existing tests pass
  • Manual testing with Word integration

🤖 Generated with Claude Code

… mode

During staging mode, the runtime flushes ops into separate staged batches at
every JS turn boundary. This means consumers like Word that want to accumulate
ops across many turns into fewer, larger batches get fragmented results.

Add a `stagingModeMaxBatchOps` option to `ContainerRuntimeOptionsInternal` that
suppresses automatic (turn-based/async) flush scheduling during staging mode
until the accumulated batch reaches the specified op count. Incoming ops still
break the current batch regardless (they change the reference sequence number
via direct flush() calls that bypass scheduleFlush()).

Default: 1000 ops. This was chosen based on production telemetry analysis:
- Copy-paste operations routinely produce batches of 1000+ ops (435K instances
  of >=1000 ops observed over 30 days via GroupLargeBatch telemetry)
- All are non-reentrant single-turn batches from normal user actions
- Receivers on modern Fluid versions (2.74+) handle these without jank
  (p99 processing duration ~5ms for typical batches)
- 1000 matches the existing "large batch" telemetry threshold in OpGroupingManager
- The threshold only affects cross-turn accumulation; single-turn operations
  (like paste) are unaffected since all ops are submitted synchronously

Consumers can override: set to Infinity to only break batches on system events,
or to a lower value for tighter batch control.

Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-murphy-agent and others added 2 commits February 26, 2026 16:24
The option controls when automatic flush scheduling kicks in, not a cap
on batch size. A batch can contain far more ops if a single synchronous
turn pushes many ops past the threshold (e.g. paste). The new name makes
it clear that only automatic/scheduled flushes are affected, not direct
flush calls from incoming ops, connection changes, or exit staging mode.

Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address PR feedback:
- Field is now always `number` (not `number | undefined`), with the
  default applied at construction time
- Add config override via Fluid.ContainerRuntime.StagingModeAutoFlushThreshold
  for runtime tuning without code changes
- Config override takes precedence over runtime option, which takes
  precedence over the default (1000)

Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an internal runtime option to control when automatic flush scheduling resumes during staging mode, allowing cross-turn accumulation of staged ops up to a configurable op-count threshold.

Changes:

  • Introduces stagingModeMaxBatchOps?: number on ContainerRuntimeOptionsInternal with a default of 1000 ops.
  • Updates ContainerRuntime.scheduleFlush() to suppress turn-based/async flush scheduling in staging mode until the threshold is reached.
  • Adds staging-mode threshold tests and excludes the option from doc-schema-affecting runtime options.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
packages/runtime/container-runtime/src/containerRuntime.ts Adds option + default constant and gates scheduleFlush() during staging mode based on accumulated op count.
packages/runtime/container-runtime/src/test/containerRuntime.spec.ts Adds tests intended to validate staging-mode batching behavior under/at threshold and with incoming ops.
packages/runtime/container-runtime/src/containerCompatibility.ts Omits stagingModeMaxBatchOps from doc-schema affecting runtime options.

- Fix comment accuracy: scheduleFlush threshold triggers at "reaches or
  exceeds", not just "exceeds"
- Fix maybeFlushPartialBatch comment: by default it throws on unexpected
  sequence number changes, only forces a flush when partial-batch flushing
  is enabled via Fluid.ContainerRuntime.DisableFlushBeforeProcess
- Strengthen threshold and incoming-op tests: assert that the outbox is
  actually emptied (mainBatchMessageCount drops to 0) rather than only
  checking that nothing was submitted to the wire

Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

@anthony-murphy anthony-murphy changed the title Add stagingModeMaxBatchOps for staging mode batch control Add stagingModeAutoFlushThreshold for staging mode batch control Mar 15, 2026
@markfields
Copy link
Copy Markdown
Member

I left some individual comments about logging in there, but thinking about it holistically, I think we want:

  • Log when this threshold is hit
  • Log when a batch is large in bytes (regardless of the main change here)
  • Be harmonious with existing "GroupLargeBatch" event (PS - Think about hosts that doesn't enable Grouped Batching)
  • Do we also want telemetry about large blobAttach batches? (regardless of the main change here)

@markfields
Copy link
Copy Markdown
Member

@anthony-murphy Claude and I chatted about test cases (and I mentioned the ones you included in the PR description). Here's the test plan that came out of it, worth browsing, there are some important corner cases here:

stagingModeAutoFlushThreshold — Test Plan

Already covered

Scenario Status
Ops accumulate under threshold
Ops flush when threshold reached
Incoming ops break batch regardless of threshold
Exit via commitChanges flushes remaining ops
No effect outside staging mode
Default threshold (1000) suppresses turn-based flushing

To add

1. Direct-flush codepaths break the batch during staging mode

These tests demonstrate that every codepath calling flush() directly still breaks
the accumulated batch, regardless of threshold. Each should verify: outbox is drained,
ops move to PSM as a staged batch, and no ops are sent to the wire.

1a. Incoming runtime op breaks batch

Already covered by existing test ("incoming ops break the batch regardless of
threshold"). Listed here for completeness.

1b. Incoming non-runtime op breaks batch

Same as 1a but with a non-runtime (signal/system) op arriving. The process() path
calls flush() unconditionally (line 3066) unless skipSafetyFlushDuringProcessStack
is set. Verify a non-runtime inbound message also drains the outbox mid-accumulation.

1c. Connection state change (reconnect) breaks batch

  1. Enter staging mode, submit ops (under threshold, sitting in outbox)
  2. Simulate disconnect then reconnect (canSendOps transitions true → false → true)
  3. The reconnect flush (line 2962) should drain the outbox before replayPendingStates
  4. Verify the accumulated ops became a staged batch in PSM
  5. Verify pre-staged ops are resubmitted correctly after reconnect

1d. enterStagingMode flushes any pending outbox contents

This covers the edge case where ops were submitted in the same JS turn before
enterStagingMode() is called (so a flush was scheduled but hasn't fired yet).

  1. Submit ops (not yet flushed — still in outbox)
  2. Call enterStagingMode() in the same turn
  3. Verify those ops were flushed as a non-staged batch (they predate staging mode)
  4. Submit more ops while in staging mode
  5. commitChanges() — verify only the post-entry ops are staged

2. Exit via discardChanges flushes outbox before rollback

Same shape as the existing commitChanges exit test but using discardChanges.
Verify outbox is drained and rolled-back ops match what was submitted.

3. IdAllocation + reconnect while in staging mode (b4e1fd1 interaction)

This is the highest-risk gap.

The fix in b4e1fd1 added scheduleFlush() after submitIdAllocationOpIfNeeded
during replayPendingStates to ensure the IdAllocation op is flushed before new ops
with different refSeqs arrive. With threshold suppression, that scheduleFlush() will
now return early if in staging mode and under threshold — potentially re-introducing
the original bug.

Test scenario:

  1. Enter staging mode
  2. Disconnect, generate a compressed ID (queued in idAllocationBatch)
  3. Reconnect — replayPendingStates submits IdAllocation op + calls scheduleFlush()
  4. Simulate remote op arriving (bumps refSeq)
  5. Generate 2nd compressed ID + submit a data store op
  6. Verify no outboxSequenceNumberCoherencyCheck error

If this test fails, the fix is to exempt the scheduleFlush() call in
replayPendingStates from threshold suppression (e.g., pass a force flag, or
call flush() directly instead of scheduleFlush()).

4. Reconnect resubmits pre-staged batches while threshold is active

Verify that ops submitted before entering staging mode are correctly resubmitted
on reconnect, and that the threshold does not interfere with their resubmission
(since resubmitted pre-staging batches go through replayPendingStates, not
scheduleFlush()).

5. Config override > runtime option > default

Single test: create runtime with both a config override and a runtime option set to
different values. Verify the config override wins. Then verify runtime option wins
over default when no config override is present.

The kill-bit switch for the flush-before-process simplification has been
in production long enough to confirm correctness. Remove the flag,
hardcode the default behavior (flush before process), and clean up the
partial-batch flushing code path that was only reachable when the flag
was enabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-murphy and others added 10 commits March 18, 2026 13:56
Extract a shared largeBatchThreshold constant from OpGroupingManager and
use it for the staging-mode auto-flush default, so both values stay in
sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move stagingModeAutoFlushThreshold from ContainerRuntimeOptionsInternal
to the public ContainerRuntimeOptions interface so consumers can
configure it. Make it required (with a default of 1000) to match the
fully-required convention of the options interfaces.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new unit tests for stagingModeAutoFlushThreshold:
- discardChanges flushes outbox before rollback
- enterStagingMode flushes pending outbox as non-staged
- config override > runtime option > default precedence (2 tests)
- incoming non-runtime op breaks batch during staging mode
- reconnect breaks batch during staging mode

Also fix ContainerLoadStats telemetry expectations to include
stagingModeAutoFlushThreshold, update type validation for the new
public API surface, and remove unused typeFromBatchedOp helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test 3 (highest risk): Verify that IdAllocation ops submitted during
replayPendingStates in staging mode are properly flushed by the "op"
handler before new ops with different refSeqs arrive, preventing the
outboxSequenceNumberCoherencyCheck error.

Test 4: Verify that pre-staged batches are correctly resubmitted on
reconnect while the threshold is active, and that staged changes can
still be committed afterward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit a telemetry event when the staging mode auto-flush threshold is
reached, including the threshold value and current batch message count.
This helps operators distinguish threshold-triggered flushes from
other flush causes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use PerformanceEvent.timedExec to measure exitStagingMode duration and
report autoFlushCount, autoFlushThreshold, and exitMethod. The perf
event is passed to the discardOrCommit callback so callers can add
properties in the future.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use PerformanceEvent.timedExec to measure exitStagingMode, reporting:
- exitMethod (commit/discard)
- autoFlushCount and autoFlushThreshold
- batches count and batchesOverThreshold (via reportProgress)

Both commit and discard paths return batchInfo arrays (deduplicated by
CSN) so exitStagingMode can compute batch stats uniformly. Also make
replayPendingStates return the replayed batchInfo array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The flag was removed in this PR, so the test passing it is now
redundant — it behaves identically to the remaining test which covers
flush-before-process behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anthony-murphy anthony-murphy marked this pull request as ready for review March 19, 2026 00:31
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@microsoft microsoft deleted a comment from azure-pipelines bot Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🔗 No broken links found! ✅

Your attention to detail is admirable.

linkcheck output


> fluid-framework-docs-site@0.0.0 ci:check-links /home/runner/work/FluidFramework/FluidFramework/docs
> start-server-and-test "npm run serve -- --no-open" 3000 check-links

1: starting server using command "npm run serve -- --no-open"
and when url "[ 'http://127.0.0.1:3000' ]" is responding with HTTP status code 200
running tests using command "npm run check-links"


> fluid-framework-docs-site@0.0.0 serve
> docusaurus serve --no-open

[SUCCESS] Serving "build" directory at: http://localhost:3000/

> fluid-framework-docs-site@0.0.0 check-links
> linkcheck http://localhost:3000 --skip-file skipped-urls.txt

Crawling...

Stats:
  272202 links
    1863 destination URLs
    2108 URLs ignored
       0 warnings
       0 errors


@anthony-murphy anthony-murphy self-requested a review March 25, 2026 23:59
@anthony-murphy anthony-murphy merged commit 5f9df7b into microsoft:main Mar 25, 2026
34 checks passed
@anthony-murphy anthony-murphy deleted the staging-batch-control branch March 25, 2026 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants